| Objective | Complete |
|---|---|
| Transform and prepare data for creating visualizations | |
| Create simple plots using Bokeh |
variablespathlib librarymain_dir be the variable corresponding to your course materials folder and data_dir be the variable corresponding to your data folder
To implement everything we learn in this course, we will use the healthcare-dataset-stroke-data.csv dataset
We will work with columns such as:
We will use different columns of the dataset to analyze stroke dataset
read_csv to read in the healthcare-dataset-stroke-data.csv dataset id gender age ... bmi smoking_status stroke
0 9046 Male 67.0 ... 36.6 formerly smoked 1
1 51676 Female 61.0 ... NaN never smoked 1
2 31112 Male 80.0 ... 32.5 never smoked 1
3 60182 Female 49.0 ... 34.4 smokes 1
4 1665 Female 79.0 ... 24.0 never smoked 1
[5 rows x 12 columns]
df = df[['age', 'avg_glucose_level', 'heart_disease', 'ever_married', 'hypertension', 'Residence_type', 'gender', 'smoking_status', 'work_type', 'stroke']]
print(df.head()) age avg_glucose_level ... work_type stroke
0 67.0 228.69 ... Private 1
1 61.0 202.21 ... Self-employed 1
2 80.0 105.92 ... Private 1
3 49.0 171.23 ... Private 1
4 79.0 174.12 ... Self-employed 1
[5 rows x 10 columns]
# Target not binary - calculate the mean and assign the above mean to 1 and below to 0
print(df['stroke'].value_counts())0 4861
1 249
Name: stroke, dtype: int64
stroke is binary already, we need not convert it id age avg_glucose_level ... work_type stroke
0 67.0 228.69 ... Private 1
1 61.0 202.21 ... Self-employed 1
2 80.0 105.92 ... Private 1
[3 rows x 10 columns]
age float64
avg_glucose_level float64
heart_disease int64
ever_married object
hypertension int64
Residence_type object
gender object
smoking_status object
work_type object
stroke int64
dtype: object
0 4861
1 249
Name: stroke, dtype: int64
strokeage 0
avg_glucose_level 0
heart_disease 0
ever_married 0
hypertension 0
Residence_type 0
gender 0
smoking_status 1544
work_type 0
stroke 0
Target_class 0
dtype: int64
age 0.000000
avg_glucose_level 0.000000
heart_disease 0.000000
ever_married 0.000000
hypertension 0.000000
Residence_type 0.000000
gender 0.000000
smoking_status 30.215264
work_type 0.000000
stroke 0.000000
Target_class 0.000000
dtype: float64
# Delete columns containing either 50% or more than 50% NaN Values
perc = 50.0
min_count = int(((100-perc)/100)*df.shape[0] + 1)
df = df.dropna(axis=1,
thresh=min_count)
print(df.shape)(5110, 11)
# Function to impute NA in both numeric and categorical columns
def fillna(df):
# Fill numerical columns with mean
numerical_columns = df.select_dtypes(include=['number'])
numerical_columns = numerical_columns.fillna(numerical_columns.mean())
# Fill categorical columns with median
categorical_columns = df.select_dtypes(exclude=['number'])
categorical_columns = categorical_columns.fillna(categorical_columns.mode().iloc[0])
# Combine the numerical and categorical columns back into the original DataFrame
filled_df = pd.concat([numerical_columns, categorical_columns], axis=1)
return filled_df
df = fillna(df)| Objective | Complete |
|---|---|
| Transform and prepare data for creating visualizations |
✔ |
| Create simple plots using Bokeh |
factor_mark() to display different markers or different categories in the input datafactor_cmap() to color map those same categories| Objective | Complete |
|---|---|
| Transform and prepare data for creating visualizations |
✔ |
| Create simple plots using Bokeh |
✔ |
You are now ready to try tasks 3-8 in the Exercise for this topic